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Abstract — This article presents an asynchronous FPGA 
architecture for implementing cryptographic algorithms 
secured against physical cryptanalysis. We discuss the 
suitability of asynchronous reconflgurable architectures 
for such appUcations before proceeding to model the 
side channel and defining our objectives. The logic block 
architecture is presented in detail. We discuss several 
solutions for the interconnect architecture, and how these 
solutions can be ported to other flavours of interconnect 
(i.e. single driver). Next We discuss in detail a high speed 
asynchronous configuration chain architecture used to 
configure our asynchronous FPGA with simulation results, 
and we present a 3 x 3 prototype FPGA fabricated in 
65 nm CMOS. Lastly we present experiments to test 
the high speed asynchronous configuration chain and 
evaluate how far our objectives have been achieved with 
proposed solutions, and we conclude with emphasis on 
complementary FPGA CAD algorithms, and the effect of 
CMOS variation on Side-Channel Vulnerability. 

Key-words: FPGA Structure, Asynchronous Logic, 
Secure Applications, Side-Channel Attacks, Native 
Countermeasures. 

I. Introduction 

Cryptography is a mean to defend against potential 
attackers, notably to protect confidentiality, integrity or 
secure authentication, whereas cryptanalysis is about the 
challenge to retrieve hidden information. There are no 
known mathematical cryptanalysis methods which can 
decrypt standard cryptographic algorithms like AES in a 
reasonable amount of time and space, assuming that the 
cryptanalyst has access to both plain-text and encrypted 
messages. However all such algorithms are implemented 
with some physical process, that leak information. An 
access to this information makes the job of the crypt- 
analyst much easier. These kinds of information leakage 
from physical processes are commonly known as side- 
channel leakage. 



For the purpose of this article, we divide cryptanalysis 
broadly in two categories: mathematical and physical. 
In this article, we assume that concerned cryptographic 
algorithms are secure at the mathematical level, and we 
specifically address the issue of physical cryptanalysis 
and countermeasures. Physical cryptanalysis can again 
be of two types, namely active and passive. Injecting 
faults to perturb the physical implementation is an exam- 
ple of active attacks, whereas attacks based on measuring 
power consumption / electromagnetic (EM) radiation are 
examples of passive attacks, commonly known as Side- 
Channel Attacks (SCAs). 

Physical cryptanalysis has been demonstrated to be 
effective against various standard algorithms, and on 
various platforms in recent times. Researchers have 
shown that side-channel attacks can be mounted on stan- 
dard cryptographic algorithms Uke DES dl, AES O, 
RSA ini- References ||54l, gll provide with the details 
of such attacks on FPGA implementations whereas var- 
ious attacks [51] has been reported on ASIC implemen- 
tation. A widely known SCA is DPA (Differential Power 
Analysis) ll32l . which exists in various forms |[T2l and 
concerns the information leaked through supply current 
peaks. Attacks which exploit the Electromagnetic Emis- 
sions (EMA) El from the hardware, constitute another 
major branch of Side-Channel Attacks. The attacks on 
RSA, which use the difference in execution time, as their 
major source of information have also been reported, 
and these are commonly known as Timing Attacks |[T9]| . 
The reader could as well find a comprehensive report of 
active attack details in ||5J. 

Now then, who's at risk? A very evident answer 
should be banking applications. Credit cards use al- 
gorithms similar to RSA for authentication, and 2-key 
Triple DES for the challenge f52|. Wholesale frauds on 
systems which rely on smart cards for their security (e.g. 
Pay-TV) could well be a target of such attacks. Mounting 
a side-channel attack calls for considerable expertise 
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and high-resolution equipments. So such techniques are 
prone to be used when there is a considerable gain. A 
major threat could be intellectual property protection 
because of this reason. The attacker can easily gain 
access to piracy protection devices embedded in com- 
mercial systems, and steal the IPs. All the more reason 
to incorporate side channel resistance into these systems. 

In the rest of article we will move in a top-down 
fashion. Section [ll] presents the asynchronous circuits 
and protocols, and the salient points of this technology. 



DATA 



Section III provides a brief overview and classification 
of side-channel attacks, the assumed models and a 
classification of countermeasures. In Section JV] we list 
the features of asynchronous reconfigurable circuits that 
make them especially suitable for security applications. 
Once the reader gains an understanding of where this 
article is situated among these vast interacting domains, 
we provide a model for the side channel in section [V] 
and set our objectives. We present the logic block 



architecture of our asynchronous FPGA in section VI 



Section VII addresses the issue of interconnect design, 
which makes up the most of the area in an FPGA and 
section VIII presents the method to port these solutions 
to the new single driver architecture. In Section [X] we 
present a prototype asynchronous FPGA and section XI 
presents the evaluation of the proposed solutions based 
on experiments. Section XII presents the conclusions 
from this research effort. 

II. Asynchronous Circuits 

In this section we discuss the key ideas of asyn- 
chronous logic. The author welcomes the reader to take 
a look at the following publications for more detailed 
discussions ||24|, [13], [44|. Figure 1(b) shows the basic 
asynchronous handshake protocol. The sender puts valid 
data on the data line and sends a request on the REQ line 
to show the validity of data. Once the receiver has read 
the data, it asserts the ACK line so that sender can put 
some new data. This basic protocol is formally defined 
as production rules or Seitz's weak conditions |[39l . 

However this basic scheme is not hazard free. For 
example, if the line has more delay than the REQ line, 
it is possible that the receiver gets the request before 
the DATA is valid. To avoid such hazards the DATA 
and REQ are often encoded into a single channel. Most 
common encodings are 1-out-of-n encodings similar to 



one hot codes for finite state machines. Figure 1(c) 
depicts the l-out-of-2 encoding. Table |l] describes the 
encoding scheme. Apart from encodings, the signalling 
protocol can also be of various flavours. The popular 
four phase protocol uses one phase for computation, 
and one phase to precharge all signals to zero state. 
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Figure 1. Asynchronous protocols: Figure |l(a)| shows the syn- 
chronous methodology where data is valid only at the positive 



edge of the clock signal. Figure |l(b)| shows the handshaking in 
asynchronous protocol, the arrows show the causality of the request 
and acknowledgement events. 

Table I 

l-OUT-OF-2, 4-PHASE PROTOCOL. 



Value 


DATAO 


DATAl 


' 0' 


' 1' 


' 0' 


' 1' 


'0' 


' 1' 


Precharge 


'0' 


' 0' 


Forbidden 


' 1' 


' 1' 



At the opposite, 2-phase protocols (NRZ) does not use 
precharge. Details about 2-phase protocols can be found 

in m, m- 

Since in a asynchronous signalling scheme each event 
has significance (as opposed to synchronous logic where 
a glitch can occur without disturbing the functionality), 
the asynchronous protocols are completely glitch free. 
Glitches are results of unbalanced input path arrival 
times (unbalanced joins). In asynchronous circuits, this 
hazard is taken care of with C-Elements l'40'l at the 
gate level, however at the transistor level, there is an 
additional constraint of forks balanced in delays. This 
constraint is commonly known as "isochronous fork" 
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constraint 11241 . and such asynchronous circuits are called 
Quasi-Delay Insensitive (QDI) asynchronous circuits. 
In this article we assume QDI asynchronous circuits 
implicitly whenever we discuss asynchronous protocols. 

Various asynchronous FPGA architectures proposed 
in literature often use the properties of asynchronous 
logic for high performance (high speed, low power, 
robustness). Typical architectures divide into several 
categories: fine grain ||48l . B9l . 14711 . coarse grain ||23l . 
Il33l . ifTTl and GALS (15]. Indeed, with asynchronous 
logic glitch-free operation and absence of clock net- 
work can substantially reduce power consumption, and 
slack elasticity HSl of asynchronous can augment the 
throughput. The architecture presented in this article has 
its focus on resistance against physical cryptanalysis, 
while enjoying the above-mentioned benefits of being 
asynchronous. 

In the most recently proposed asynchronous FPGA ar- 
chitecture 1.48.1 , 1.49,1 . 1.47 1 the logic block is designed as- 
suming the l-out-2 4-phase protocol, and uses pipelines 
in the routing switches to increase throughput. Routing 
segments are a 3-wire bundle (DATAO, DATAl, ACK). 
Our logic block architecture is much more fine grain to 
accommodate a plethora of encoding schemes and styles 
and the routing architecture is single wires, on which 
dual-rails are routed together. This is done, keeping in 
mind the prototyping role of an FPGA, and flexibility 
required for dynamic countermeasures to operate. Since 
we don't know of any future -proof solution to resist 
physical cryptanalysis, this architecture will provide the 
designer a soft fine grain fabric on which he/she can 
implement a mix of dynamic and static countermeasures 



Template. (The model itself is created from experi- 
mental measurements on one sample, which is then 
used to predict the traces for clone circuits II3) 



pertinent to the application. In section XI we will discuss 
the additional cost to be paid for this added flexibility. 

III. Side Channel Attacks 

Side-Channel Attacks are very similar to Spec- 
troscopy(NMR) used over the years. While in spec- 
troscopy, patterns in the light spectrum are used to 
detect the presence of atoms and its environments in 
an unknown substance, in a Side- Channel Attack the 
cryptanalyst looks for patterns in the power consumption 
or EM emission to detect the unknown key value. 

We classify Side-Channel Attacks in two ways. They 
are either based on the acquisition method: 

• Supply current measurement (DPA) 

• EM emission measurement (EMA) 

• Timing difference measurement (Timing Attack) 
or based on processing methods: 

• Correlation based. (The measured traces are corre- 
lated with the predicted trace from assumed model) 



We will give a very basic example of how a side-channel 
attack is carried out. A broad overview can be found 



in [51 1. Referring to Fig. 1(a) we can see that in an 
unprotected synchronous logic, each change of state of 
a signal can be distinguished by a current spike, and no 
change of state by an absence of current spike in the 
power supply line. Given this basic behaviour of CMOS 
circuits a side-channel cryptanalyst could proceed in the 
following fashion: 

• He finds a signal which is a function of say N bits 
of the key, and the input message of M bits. 

• He performs the encryption 2*^ for each different 
message value and acquires the power trace of each 
of them. 

• He makes 2^ key guesses and for each key guess 
he make current spike predictions for 2*^ traces. 

• Among these 2^ current spike predictions for 2^'^ 
encryptions, the one which correlates best with the 
measured power trace, is the correct key guess. 
Indeed, the asymptotic prediction matches the real 
observation only for the correct key guess. 

The activity of other nodes are not correlated since 
they are not a function of the targeted message and 
key bits. These activities of other nodes appear as noise 
after the processing of acquired traces. The resistance to 
physical cryptanalysis, relies on maximizing this signal 
to noise ratio (SNR) either by static or dynamic coun- 



termeasures. Figure 2(a) shows the raw power traces 



after acquisition on an ASIC implementation of DES, 



and figure 2(b) shows the appearance of predicted peak 
when correlated with the right key guess. 



A. Countermeasures 

The reported countermeasures to side-channel attacks 
can be classified as dynamic countermeasures (incorpo- 
rated at run-time) and static counter measures (incorpo- 
rated at design time). 

1) Dynamic Countermeasures: The principle of dy- 
namic countermeasures is to decorrelate computing from 
the power supply current, by randomising transitions, 
commonly known as "masking". This can be by either 
precharging signals with random values, or introducing 
random delays in computing paths ||3T]| . [31, r42]. These 
masking techniques are introduced at the algorithmic 
level. Details about implementing and attacking such 
countermeasures can be found in |[34l . ll38l . 
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(a) Raw power traces after acquisition. 




(b) After Processing. 

Figure 2. Correlation peaks appear for tfie right guess: Figure |2(a)| 
siiows tlie power consumption traces of a complete DES encryption 
process. This raw power trace is analysed for 64 possible sub-key 
guess. Figure [2(b)] is the resulting waveform for the right guess for 
two power predictions. 



2) Static Countermeasures: Static countermeasures 
rely on producing a constant power consumption profile 
independent of the data being computed. This is com- 
monly done with differential signalling with a precharge 
to '0', where power consumption profile of one rail hides 
that of the other. Examples of such countermeasures are 
WDDL [501, Backend DupUcation L16J, or STTL f3T|. 
1-out-of-n asynchronous signalling also falls in this 
category. To mitigate the remaining unbalance of the 
various rails, an unpredictable random switching of them 
can be enforced. MDPL [36] is a typical example of such 
a strategy. 

IV. Suitability of Asynchronous FPGA for 
Physical Cryptanalysis Resistance 

In this section we point out to the reader the moti- 
vations behind the architecture we are going to discuss. 
The suitability of asynchronous circuits for cryptanalysis 
resistance has already been investigated by |[27l . 



• Resistance to Fault Attacks. Random introduction 
of faults stalls the asynchronous circuit 11281 . Il26l . 
So the cryptanalyst does not receive the encrypted 
messages with fault syndromes. To do so, faults 
have to be injected very carefully, at a precise time 
and location, which makes the attack considerably 
difficult. 

• Absence of a Time Reference. The absence of 
a reference signal (i.e. CLK) in an asynchronous 
circuit, prevents the attacker to assume a precise 
model for transitions he is trying to predict, whereas 
in a synchronous circuit the targeted transitions 
must occur within the clock cycle. Moreover, the 
power consumption of the clock signal is clearly 
visible in the power trace, and provides an overall 
idea of the circuit operation. 

• Power Constant Signalling. As shown in fig- 
ure 1(a) the supply current spikes, clearly denote 



the change of state of the signal, or the absence 
of a peak denotes that no changes occurred in the 
signal value. On the contrary, for asynchronous 1- 
out-of-2 signalling (see Fig. l(c)| ) each valid signal 
value is accompanied by one spike in the supply 
current. Note that both signals are precharged to a 
neutral value ("00") in between the valid data. This 
power constant signalling falls well into the cate- 
gory of static countermeasures previously discussed 
in section lni-A2[ 

• Absence of Glitches. In synchronous implemen- 
tations, glitches can occur without disturbing the 
functionality. Glitches magnify the current spikes 
shown in figure 1(a) Reference 1141 discusses 
the effect of glitches on Side-Channel Attacks. 
As discussed in section [n] asynchronous circuits 
can not work in the presence of glitches and are 
consequently less vulnerable. 

• Reconflgurability. The motivation to opt for a 
reconfigurable architecture, rather than a hardwired 
circuit is firstly to achieve a mix between the dy- 
namic and static countermeasures, (see section [III- A| 
and ifTOl . |[25l ) depending on the application. Sec- 
ondly, as a prototyping platform for evaluating vari- 
ous asynchronous styles and/or masking techniques. 

V. Modelling the Side Channel 

A. Dynamic Power Consumption Model 

At first order, static leakage power in CMOS does 
not contain any information about the computation being 
performed and is a constant hence we do not take it into 
account for SCA. Power consumption is proportional 
to the current charged and discharged from the power 
supply. 
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(b) Model for a buffered net. 

Figure 3. Dynamic power consumption model: Figure |3(a)| shows 
the power model of a CMOS logic circuit as the sum of +ve/-ve 



step responses of the individual gates. Figure [3(b)| shows the power 
consumption model of buffered nets considered as gates with only 
one input. Unbuffered nets are included in the capacitance seen by 
the gate driving that net. 



We model dynamic power consumption in CMOS in 
two levels. 

• First the internal power consumption of the gate, 
which is due to charging and discharging of internal 
nets and transistor short-circuit currents inside the 
gate. 

• Secondly the power consumption of the net driven 
by the gate which also includes the input capaci- 
tances of the driven gates. 

Side-Channel Information is in the dynamic current 
profile of the circuit, thus we need a detailed model for 
this consumption. For this reason, each gate and each 
net is associated with its step current response (i.e. the 
contribution of the component to the current Imeasured 



(see fig. |3(a)| )). 

a) Gate Level: We consider gates with N inputs. 
Thus the gate is characterized by its step current response 
as the input vector undergoes a transition ij — )■ ifc while 
the gate output is open. 



S ^ (''") — Imeasuredij'^ 



(1) 



and the gate is characterized by the set of all step 
responses corresponding to each transition. 



U 



0<i<2"-l 
0<i<2™-l 




Figure 4. Delay model for the FPGA interconnect. 



b) Net Level: Each net has only one input, hence 
it is characterized by the set which contains its positive 
and negative step response. 



0^1/ 



while the input to the net is a positive or negative step. 

We consider both positive and negative step response 
because the charging and discharging network for the net 
could be different in the actual layout. 



We model a buffered net as depicted in figure 3(b) It is 
a delayed sum of the step responses of each segment, and 
the step responses for active gates (buffers). This point 
to point delay is calculated using widely used Elmore 
Delay model as described in the next section. 

1 ) Delay Model: To calculate the point to point delay 
in the above model we use the widely used Elmore delay 
model |41|. The Elmore delay is given by: 

N 



k=l 



where is the number of capacitances in the equivalent 
network and R-ik is given by 

Ril^ = ^Rj ^ [Rj G [path {S i) D path {S k)]) 



B. Secure Place-Route Objectives 

1 ) Indiscernability in power consumption: 

a ) Gate Level: In section |V-A we have modelled 
the power consumption profile of a gate (a LUT in 
this case) as a set of step current response for each 
transition. Asynchronous logic gates are mapped into two 
symmetric LUTs as shown in figure [7] According to the 
l-out-of-2 4-phase protocol, only one of these two gates 
will evaluate for each evaluate and precharge cycle. To 
guarantee that the current consumption profile of these 
two gates are similar, we try to assure that: 
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Figure 5. Imbalance in capacitance and delay leakage information. 



• for a LUT each path from input to the output is 
indiscernible from each other. 

• each path from the configuration memory point to 
the output is also indiscernible from each other. 

If these conditions are met, it is not possible to predict 
which gate has evaluated, even if the corresponding dual- 
rails does not use the same inputs of the corresponding 
LUTs. 

b) Net Level: Figure [5] shows the effect of im- 
balance in capacitance and delay in dual-rails, on side 
channel leakage. Even these small differences could be 
exploited for cryptanalysis lISTI . In section |V-A| we have 



modelled each interconnect by its positive and negative 
step current response. Ideally: 

• the -i-ve and -ve step current response for each dual- 
rail should be identical. 

• for a buffered net, the buffers used should be 
identical, and each segment between buffers should 
also have identical step current response. 

As a measure of indiscemability, we use the cross- 
correlation of the step responses of each net. The higher 
the cross-correlation, the more difficult it is to predict 
which net has undergone transitions. 

However to simplify the design procedure we make 
the following assumptions: the same length of wire 
of same width, charging the same capacitances, has a 
similar step current response, that is consumes the same 
current and causes the same delay irrespective of any 
bends. 

In this respect we also define equitemporal lines for 

n-wire signals. An equitemporal line ( ) is the set 

of points attainable simultaneously by signals originating 
from synchronized sources {i.e. wave fronts.) 
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(a) PLB Functional Overview 





(b) Block P in Detail 



(c) Block X in Detail 
Figure 6. PLB overview. 

2) Indiscemability in EM emission: The radiation 
pattern measured at any point in space should be the 
same for each wire of a n-wire bus. For example, given 
a set of parallel wires (not twisted), we can choose a 
point closer to one of the wires and further from others. 
At that very point in space, radiation patterns emitted 
from different wires will be distinguishable. 

At the gate level, the dual gates should be placed as 
close as possible, and at the net level we propose to route 
the dual-rail signals as a twisted pair (abridged "T-Pair") 
to deter EMA. 

VI. Logic Block Architecture 



Fig. 6(a) 



depicts the structure of the PLB (Pro- 
grammable Logic Block) able to handle the main QDI 
asynchronous styles. Black triangles at the left of LUTs 
are multiplexers controlled by programming points. The 



P output block is described more precisely by Fig. 6(b) 



and Fig. 6(c) in which the black diamonds represent pro- 



gramming points. The feedbacks from output to inputs 
are used to obtain the memory effect. 

Details of the PLB and mappings can be found in ifTSll . 
A summary of all the implementable styles in the pro- 
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Table II 

Mapping Capacity of different protocols 



B 



2-phase EDGE 1 n.a. n.a. 
2-phase LEDR 0.5 1 n.a. 
4-phase 0.5 1 2 



6^1 
L.U.T. 



6^1 
L.U.T. 



Oo 



Figure 7. Mapping of l-out-of-2, 2-input Gates onto the PLB 



posed PLB is given in table [Tl] The results are given in 
number of PLBs. The three configurations considered in 
this table are: 
A: Dual Rail, 2-input gate with Acknowledge input, 
B: Dual Rail, 3-input gate with Acknowledge input, 
C: Triple Rail, 2-input gate with Acknowledge input. 
For the purpose of this article, we explain mapping 
of l-out-of-2, 2-input Gates onto the PLB here. Other 
mappings can be found in ifTSl . 

3) l-out-of-2, 2-Input Gates: Let /(x, y) : F2 x F2 i-)- 
F2 a two-variable Boolean function. The inputs are 
represented by 4 wires: xq, xi, i/q and yi, to which 
a synchronization signal (Acknowledge), Sin, is added 
and the output signal O represented by two wires: Oq 
and Oi, together with an acknowledge output Sout- The 
individual values of these wires are functions of x and 
y, respectively denoted f^{x,y) and f^{x,y). 
The equations of the outputs are: 



O1 



On 



'fix 


y) 


if x,y^ (0,0) A Sin 


= 0, 


< 




ifx,y = (0, 0) A Sin 


= 1, 


Oi 




otherwise , 




'fix 


y) 


if x,y^ (0,0) A Sin 


= 0, 


< 




ifx,y = (0,0) A Sin 


= 1, 


Oo 




otherwise , 





(2) 



in which "x,y = (0,0)" stands for "x = (0, 0) A y = 
(0,0)". 

Eq. (|2]) shows that Oq and Oi are functions of 6 
Boolean variables. Thus the minimal practical size for 
the LUTs is 64 bits, which can implement a 6-bit 1— )• 1-bit 
function. As there are two output bits the minimal size of 
the PLB is 2 LUTs. The Sout output can be computed as 



CQNFIGURATIOW BITS 




ARRAY OF TRANSMISSION GATES 



Figure 8. LUT implementation with a wired AND. 



(Oi V Oo). However, for homogeneity with the 2-phase 
protocols, we use (Oi © Oq) instead. Note that, as the 
Oq = Oi = 1 state is forbidden, the two functions are 
identical on the allowed domain. 

A feedback is necessary to obtain the memory ef- 
fect. One of the inputs to the LUTs must thus be 
programmable between the input to the PLB and the 
feedback. As the inputs to the LUTs are the same, with 
the exception of the feedback wires, there can be a single 
connection box to the routing network. Fig. [7] shows the 
minimal structure of the PLB, which allows to implement 
2-input gates with synchronization. 

The associated wiring is: 



00 = LUT6( Oq , Sin ,xi ,xo ,yi ,yo 

01 = LUT6( Sin , Oi ,xi ,xo ,yi ,yo 



(3) 



In Eq. ([3]), each wires of x and y is loaded with exactly 
the same number of inputs. Note that the Sin signal is 
connected twice to the routing network. 

4) Balanced LUT Implementation: Figure [8] depicts 
the LUT implementation scheme to achieve the objec- 
tives set in section V-B In a classical LUT implemen- 
tation with multiplexer trees each input is loaded with 
a different capacitance, and also the logic depth from 
configuration bits to the outputs varies with the input 
activity. 

To circumvent these drawbacks, all input capacitances 
have been balanced by correctly adjusting the sizes of 
decoder's inverter, and a unique logical depth is imposed 
between the configuration bits and the LUT output. More 
details about the LUT implementation and simulation 
results can be found in [8|. 

Figure |9] describes the actual PLB layout. The four 
6 — 1 LUTs are placed symmetrically, and in between 
the input multiplexers are placed which connects the 
LUT inputs to the routing resources, and feedbacks. The 
block P is placed at the top. All the twelve inputs are 
placed at the top, and the seven outputs at the right of 
the PLB layout. 
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LUT4 



LUT6 



LUT6 




LUT6 



LUT6 



I/P MUX 



Figure 9. PLB layout in Cadence VIRTUOSO. 



VII. Routing Architecture 

We use the traditional mesh routing architecture and 
associated nomenclature as explained in ||7l. 



A. Subset Routing Architecture 

A subset switchbox |7| can be built by repeating a 
basic six-way switch-points along a diagonal, as shown 



in figure 10(a) We consider that the diagonal formed 
by the six-way switch-points makes up equitemporal 



signals (see section V-B i if these signals are outputs 



of the same FPGA logic element CLB. Figure 10(b) 
shows the routing matrix using a subset switchbox. 
Connection boxes from the equitemporal lines to the 
CLB inputs/outputs are considered as equi temporals. 
They aie discussed in section VII-D In figure 10(b)[ 
the dual pair signals corresponding to connections {A, 
A'} and {D, D'} have exactly the same length and the 
same electrical characteristics. The same goes for buses 
{B, B'} and {C, C'}. Notice that the dual-rail signals 
are not necessarily routed in an adjacent way (case of 
A and D) and that it is possible to route in the same 
fashion multi-wire signals. 

B. Twisted Pair Routing Architecture 

As a countermeasure against information leakage 
through EM radiations, we propose to route every n- 



rail signal as a twisted bus. Figure 11(a) shows the 
advantages of using a twisted pair compared to parallel 
routed wires. If we consider the twisted pair as made 
up of several elementary radiating loops, we see that the 
radiation from a loop is cancelled by that of adjacent 
loops. 

In addition to reducing EM compromising radiations 
(outputs), the twisted bus gains immunity from its EM 
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(a) Subset switchbox using six-way switch-points. 




(b) Equitemporal lines for subset switchbox routing. 
Figure 10. Subset switchbox Routing. 

vicinity (inputs). Consequently, twisting signals bundles 
reduces cross-talk. 

In order to route any n-rail signal as a twisted bus 
throughout the FPGA, two novel switchboxes are intro- 
duced in §VII-B1| and [yiTC 



1) Twist-on-Turn Switch Matrix: The basic idea be- 
hind this switchbox is that every pair or n-uplets of sig- 
nals deflected by the switchbox must come out twisted. 



As shown in figure 11(b) every ±7r/2 bend through 



this switchbox is a twisted pair. We can express this 
switchbox using the notation described in [53 1 as: 



s= U 



[t(0,i),t(2,i)] , 
[t(l,i),t(3,i)l , 
[t(0,i).t(l,i)l , 

[t{l,i),t(2,W - i - 1)] , 
[t(2,i),t(3,i)] , 

[t{3,i),t{0,W - i - 1)] . 
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t(3,0)((3.1)*(3,2)((l3) 

(a) Twist-on-Turn. 




«(3,'0)«(3, l>(S-2)f(3.3) 

(b) Twist- Always. 



Legend: 

ffi vector into the page 
® vector out of the page 

(a) Electric & magnetic fields orientation in an un-twisted (a) and in a 
twisted (b) pair. 



CLB 



A B C D 



-2 



> < 



Metal 1 
Metal 2 



(b) Equitemporal lines for the twisted-pair switch- 
box. 

Figure 11. The twisted-pair switchbox. 



(c) Twist-always switch matrix layout scheme. 



Figure 12. The twisted-pair switchboxes. 



buffered, which can safely be done with active gates, 
because every wire in a channel is equitemporal. 



where each terminal is represented as t{j,i), where j 
denotes each subset corresponding to each side (0 = left, 
1 =top, 2 = right, 3 = bottom) and i G [0, W[ denotes the 
position of the terminal in that subset. Connection boxes 
from the equitemporal lines to the CLB inputs/outputs 
are considered as being equitemporal perpendicular to 



the routing channel. They are discussed in section VILD 



In figure |ll(b)[ the dual pair signals corresponding to 
connections {A, A'} and {D, D'} have exactly the same 
length even if they cross at the switching box. It is 
exactly the same for buses {B, B'} and {C, C'}. 

When turning, this switch matrix introduces a small 
imbalance for the arrival time on the deflecting switch 
point. If the switch point is implemented with passive 
gates, this balance violation is not observable by an 
attacker. The counterpart is that the channels must be 



C. Twist-Always Switch Matrix 

The twist-on-turn matrix does not twist buses when 
they are routed straight. This matrix can be transformed 
into a twist-always matrix by twisting the wire i with 
wire W — I — i for straight connections, as shown in 



figure 12 W being the number of channels. 

This matrix allows the use any 1-out-of-n (asyn- 
chronous) style, as it is possible to twist a number of 
lines greater than two. 

This switchbox cannot be implemented with tradi- 
tional six- way switch-points, even if the number of 
transistors remains the same. A possible implementation 



of the twist-always switch box is shown in figure 12(c) 



It can be laid out in silicon with two interconnect layers 
and by repeating two basic patterns over space. Note that 
for straight (e.g. from left to right) connections, the outer 
rails are drawn wider than the inner rails to compensate 
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for the difference in lengths. Alternatively, every wire 
can keep the same nominal width, but inner rails are 
forced to zigzag so as to make up for their shorter length. 
For bends, every rail traverses an equal distance, hence 
this compensation is not required. 

These new switchboxes are close to conventional 
universal/subset switchboxes in terms of connectivity. 
Hence we can expect similar performance in routabiUty 
of netlists in the FPGA. 

D. Connection Box Implementation 

1) Cross-Bar Connection Box: As depicted in fig- 



ures 10(b) and 11(b) a signal routed from one equi- 
temporal line to another has the same delay. Therefore 
the connection box (C-Box) between the W channel 
wires and the CLB / € [0,1^ [ inputs/outputs should 
also keep this equitemporality. We propose to use a 
crossbar connection box based on balanced binary trees, 
built according to the following three rules: (i) from the 
channel, W trees have / equal- length branches, ( ii) from 
the CLB, / trees have W equal-length branches, (Hi) the 
two trees are superimposed orthogonally and the W x I 
branches from each tree type meet via a switch point. 



Figure 13(a) illustrates the layout of the balanced 
crossbar with W = A and 1 = 4:, using only two metal 
layers (represented with two different thicknesses.) The 
crossbar area is W ■ [log2 {!)'] x I • riog2 (W)'] square 
routing pitches, and can be freely depopulated without 
altering its security level. 

2 ) C-Box for Subset & Twisted-Pair Switch Matrix: 
The equitemporal lines are either diagonal (for the subset 



switch matrix, cf Fig. 10(b) i or horizontal/vertical (for 



the twisted switch matrix, cf Fig. 12(c) ) The connections 
between the channel and the crossbar should compensate 
for the wire length delays. A solution for both cases is 
illustrated in figure [T3] 

Example layouts and statistics of the T-pair switchbox, 
and the binary-tree connection box can be found in ifTTl . 

VIII. Single Driver Architecture 

Single driver segments are shown to give better de- 
lay performances and better area-delay product for an 
FPGA i20l . Il22l . These benefits are result of less loading 
of interconnect segments, and availability of equal num- 
ber of tracks in each direction as in traditional bidir-tri 
routing architecture. 



Figures 14(a) shows how the subset switchbox layout 



scheme 10(a) can be ported to single driver architecture. 
Note that the connection box nets are routed as a binary 



tree in both X and Y direction. Figure 14(b) shows the 
layout scheme for the T-pair switchbox with single driver 



^0 



!2 



(a) Balanced crossbar for the connection 
box. 



Wo Wi W2 Ws 



(b) For Subset. 



Wo Wi W2 W:i 
i 



(c) For T-pair. 

Figure 13. Balanced crossbar and connections between channel 
wires and crossbar. 



interconnects. Note that this scheme too uses a basic 



switch-point as in 12(c) which is rotated as required. 



IX. Configuration Architecture 



Figure 15 describes the configuration architecture for 
our asynchronous FPGA "SAFE". The configuration 
chain is designed assuming l-out-of-2 4-phase protocol 
(as described in figure l(c)[ ). It consists of a series of 
Full Buffers terminated by an initialisation circuitry. The 
function of the initialisation circuitry is to bring the 
whole chain to ('0','0') state, and to avoid any invalid 
state ('r,'r) after power up, or to erase a previous 
configuration. For initialisation the signal IN IT is put to 
'0', and (conf ig-in-0, config-in-1) are held at 
'0'. The 'O's from the configuration input will propagate 
throughout the chain making a reset action. For any 
invalid state ('r,'r) after power up, we can see that 
the output of the nor gates in the chain will be '0', so 
the invalid state wiU be overwritten by ('0','0') from the 
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Config-in-l 



Config-ick-Out 




Coifig-lll-t 



PLB Output 



louting Track 



Routing Track 



Routing Track 



. Conf ig-out-0 




Initialization Circuitry 



Figure 15. The asynchronous configuration chain with reset circuitry. 



Conf ig-in-O 



Conf ig-Ack-Out 




• • • • 




LUT Output 



Temporary Short Circuit during input change 



Figure 16. If the LUT inputs are not defined, there could be a temporary short-circuit. Due to this bug the LUTs are not usable in the 
FPGA, need to be replaced by buffers instead of pass transistors. 



input of the chain. Since the signal IN IT is held at '0' 
during the initialisation phase, the acknowledge input to 
the last stage is '0' for 

(conf ig-out-0, conf ig-out-l) 
= (0,0) or (0,1) or (1,0) . 

It will only accept a ('0','0') from the previous stage. 

During normal operation, IN IT and consequently the 
acknowledge input to the last stage is held at ' 1 ' , hence 
the chain will accept any valid state until the chain is 
full. 

Figure 17(a)| shows the mixed-signal simulation results 
for an asynchronous configuration chain with five fuU- 
bujfer stages and a initialisation cap as explained in 



fig. 15 During the initialisation phase, the input to the 
configuration chain CONFIG_IN and the IN IT signal 
is forced to '0'. We can see that the ('0','0') propagates 
along the configuration chain, and the acknowledge 
signals are initialised to '1' (READY) state. We can 
see that even if the configuration signals are powered 
up at an invalid {'V ,'V) state, the reset function of the 
initialisation cap brings the chain to a valid precharge 
state which is ready to accept a new configuration. 
In the configuration phase the IN IT signal is kept 



at '1', and we can see in fig. 17(a) the configura- 
tion bits advancing through the chain. The waveforms 
presented are from a mixed-signal simulation, where 
the five full-buffers are analog, the initialisation cap 
and the testbench controlling CONFIG_IN are digital 
circuits. Because of this reason the signals ACK and 
ACK_FORM_CAP at two ends of the chain have different 
delays. 



In Figure. 17(b) we present the simulation results of 
the configuration of a single 6-input LUT with parasitic 
RC to determine the highest achievable configuration 
speed. We used STARRCXT for parasitic extraction and 
ADVANCE MS (tool from Mentor Graphics [1]) for 
simulation. In the waveforms the input is the digital input 
to the configuration chain and the acknowledgement 
(ACK) from the chain which is a analog circuit with 
parasitic RC. The rise-time (t^), fall-time {tj) and delay 
of the virtual analog to digital converters for simulation 
are kept very small (~ 0.1 ps) so that they do not affect 
the simulation results. 



From fig. 17(b) we can see that we can configure 64 
bits in about ~ 40 ns. Thus the maximum achievable 
configuration speed is around ~ 1.6 GHz. This high 
speed is particular to the asynchronous configuration 
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(a) Single Driver Subset Switch-Box 



(b) Single Driver Tpair Switch-Box 
Figure 14. Porting the solution to single driver architecture. 



Initialization Phas 



Configuration Phase 

(a) The initialisation and configuration phases. 
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(b) The maximum speed of the Configuration Chain 1.6 GHz), 
simulated with RC extracted view. 



chain. This is mainly due to the fact that the each 
configuration stage output see very small capacitive load, 
since it has a fanout of 2: one for the next stage 
and one for the switch connected to the output (see 
fig. 15 1. Although this is also true for synchronous 
shift register chains, their speed is often limited by the 
clock tree skew. The reader might note that there is a 
small difference in the acknowledgement (ACK) delay for 
CONFIG_DATA_0 and C0NFIG_DATA_1. This is due 
to the fact that in our FPGA the switches are connected 
only to C0NFIG_DATA_1, thus it has a bigger delay 
than CONFIG_DATA_0. This is also particular to the 
asynchronous configuration chain, because in the syn- 
chronous case we had to use the worst-case delay to 
design the clock tree. 



Figure 17. Simulation of the asynchronous configuration chain. 



Table III explains the count of configuration bits in 
our prototype. Note that the following formulas are illus- 
trated in counting the switches of the routing ressources: 



Table III 

No. OF Total Configuration Bits. 



SubModule 


Qty . 


Switch 
Count 


Total 


PLB 


9 


287 


2583 


PLB Connection Box 


9 


(12 + 7) X 8 


1368 


10 Connection Box 


12 


(3 + 3)x8x0.5 


288 


10 Config Bits 


12 


3 x 12 


35 


Switchbox (Full) 


4 


6x8 


192 


Switchbox (1/2) 


8 


3x8 


192 


Switchbox (1/4) 


4 


1x8 


32 








4691 
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(a) Chip Micro-Graph. 

Figure 18. Prototype: This chip has been fabricated in CMP run 
S65C8_1 in Sept.2008, the date on the photo is a camera malfunction. 



• No. of switches in a switchbox with W channels 
and N sides ^ [^) x W 

• No. of switches in a connection box with Nj inputs, 
No outputs and W channels -(Ni + Nq) xWxFc. 

In our case W = 8, for PLBs Nj = 12 and Nq = 3 
and for lOBs Nj = 3 and Nq = 3. For PLB connection 
boxes, Fc = 1 but for lOB connection boxes Fc = 0.5 

X. Prototype 



Figure 18(a) shows the 3x3 prototype asynchronous 
FPGA which has been delivered by the foundry. This 
chip has been fabricated using ST Microelectronics 65 
nm 7-layer process. The FPGA channel width is 8, and 
there are 9 I/Os for each side. 

This FPGA has been laid out using a automatic flow 
as described in L9|- The balanced place and route has 
been obtained through the following steps: 

• We first placed the switches, in a symmetric fashion 
as outlined in the layout schemes. 

• All channel segments are manually routed to 
achieve required balance. 

• Configuration memory points and signals are 
placed/routed automatically, because those re- 
sources are not sensitive. 

Since size of the FPGA mainly depends on size 
of switches and configuration memory points, and not 
limited by the routing area. Incorporation of balanced 
routing does not result in an overhead in terms of area. 

The prototype occupies an area of 1111.6 fim x 
947.6 fxm in silicon and contains approximately 200,000 
transistors. 



Figure 19. The test setup with Altera DE2 board and a small PCB 
containing our Prototype. 



XI. Experiments 



We have carried out two experiments. Figure 19 shows 
the experimental setup. We used a DE2 board from 
Altera to provide the test signals and acquisition of 
response from the prototype FPGA. The synchronous- 
to-asynchronous converters are implemented in the DE2 
board. 

A. Experiment 1: Configuration 

The sequence for this experiment is the following: 

• First the asynchronous configuration chain inside 
SAFE is initialised, by putting the INIT input of 
SAFE to '0' and then released after some time. 

• The bitstream file generated by VPR O is loaded 
onto the driver board (an Altera DE2) RAM 
from a PC. This bitstream is then converted to 
asynchronous l-out-of-2 coding and sent to the 
configuration-0 and configuration-! input signals of 
SAFE. 

• The DE2 Board monitors the acknowledge output 
from SAFE, and puts a new value in the config- 
uration chain following 4-phase handshake. It also 
counts the number of acknowledges received. For 
a successful configuration it should receive exactly 
4692 acknowledgments. 

We observe that when the bitstream contains only 
'O's the configuration is successful each time and with 
a very high speed. We tested up to 50 MHz, the DE2 
board frequency. When the bitstream contains 'I's the 
configuration only succeeds at a low speed (around 
10 kHz). 

Despite the fact that due to asynchronous coding 'O's 
and 'I's are equivalent in terms of transitions, we think 
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(a) Measuring the difference between rail-0 and rail-1 for different 
5 lengtiis of rail-0. 



(a) The whole configuration trace of 4692 bits. 
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(b) Zoomed view. 



Figure 20. Measured trace of configuration signals. 



that this behaviour might be due to a bug which we will 
be trying to explain with simulations in appendix [A] at 
However, it does not hinder the test in real 



(b) Different pairs of dualrails to evaluate effect of hop-mismatch. 
Figure 21. Description of the experiment 2 "Hop Mismatch". 



route for RAIL- 1 of this dual-rail stays constant over the 
experiment. The route for RAIL-0 is varied incorporating 
(0,1,3,5,7) differences in hops w.r.t. RAIL-1, as shown 



page 

silicon of the routing strategy presented in Sec. |VII 



B. Experiment 2: Measuring Hop Mismatch 

The goal of this experiment is to find out, the bal- 
ancedness between the two rails constituting the dual- 
rail of asynchronous logic. This is very important in 
terms of robustness against side-channel attack since any 
difference between these two rails will give out the data 
values to the attacker. The power-constant methodology 
entirely depends on this balancedness. Indeed, other 
biases, like the "early propagation effect" 11451 . do not 
exist in a "routing-only" netlist. 



in figure 21(b) We send 4 pulses each for RAIL-1 
and RAIL-0, and we measure the power consumption. 
This measurement is done with a separate trigger signal 
encompassing 4 + 4 pulses. Among the 4 pulses we 
choose a window for the comparison. Let's say Wi{t) is 
the trace for RAIL-1 and W^it) is the trace for RAIL-0 
within this window. 

The balancedness between these two traces is then 
calculated as: 

RMS{WQ{t)) 



Balance 



RMS{Wi{t)) ' 



where RMS is the root mean square. This ratio captures 
the difference of energy contained in each side-channel 
curve. 



Figures 21(a) and 21(b) describe the experimental 
setup. We route a dual-rail netlist in the FPGA. The 



The corresponding traces are shown in Fig. 19 We 
can see the triggering instants of the measurement. The 
traces for various hop mismatch and the balancedness 
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Figure 22. The results for different values of Hop Mismatch 
(superimposed and staggered). 



values are indicated in |22j Figure [T9| shows the traces for 
different hop mismatch values superposed in a staggered 
fashion for comparison purpose. They use the same color 



code as in fig. 21(b) 



Hop Mismatch 


Balance 





1.106 


1 


1.120 


3 


1.158 


5 


1.119 


7 


1.173 



The results show that the hop mismatch definitely has 
an influence on the observed balancedness. This devia- 
tion of the balancedness from one indicates that a bias 
exists; such a bias is typically exploited by side-channel 
analyses, and therefore quantifies the implementation 
vulnerability. One also notes that the unbalance is not 
strictly speaking linear with the hop mismatch. 

C. A note on Variability 

The results regarding "hop-mismatch" and "Balance 



of Power consumption" in subsection XI-B can be in- 
terpreted as the superposition of a linear component and 
random components. We can see that power consumption 
of a dual rail is largely proportional to the hop mis- 
match, so there is a strong linear component of power 
consumption of dual-rails. However there is also a non- 
linear component as we can see from the result for "hop 
mismatch =5". The authors suspect that this non-linear 
component is mainly due to variation. Notable causes of 
variation in deep-submicron CMOS are [i46ll 

• Systematic Process Variation 

• Layout Dependent Variation 

• Random Variation (Transistor Mismatch) ll35l 



m m m m m m 
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: □ ; ;i ;i ;i □ 



□ □ 



□ ;i □ : □ : 

□ □□□□□ 

□ □ □ □ 




Block 424 (inreg_reg_1 1_SDDL1) at (12, 17) selected. 



Figure 23. Dual Rail routing Example 



In this article we tried mainly to minimize the Layout 
dependent variation, and systematic process variation can 
be reduced by using matched transistor pairs. However 
the ever increasing random component of variation (with 
technology scaling) could still be exploited by the at- 
tacker, and probably constitutes a fundamental limit to 
the anonymity from side-channel attackers. 



D. Secure Dual Rail Routing 

As we have seen the previous chapter, that there is a 
strong linear component in the effect of hop-mismatch, 
and power consumption balance between dual-rails, we 
should guarantee balanced dual rail routing in CAD 
tools. We devised a simple dual rail routing algorithm 
where the routing tracks in FPGAs are divided into 
two domains, and each rail is then routed in separate 
domains. Because of the symmetrical domains, the routes 
for both rails are the same. However this will work 
only for homogeneous architectures, such as tracks with 



unit lengths, and subset switchbox. In table IV we show 
the results of dual rail routings for some netlists in the 
QUIP L4J Benchmarks suite in a simple uniform architec- 
ture. The benchmarks are WDDL implementation of the 
netlists. As shown in table llV] we can guarantee a zero 



hop-mismatch routing with this technique. However this 
is only valid for simple homogeneous architectures, and 
needs considerable modification to be used in commer- 
cial architectures. 
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Table IV 

Channel Width and Hop-Mismatch for Dual Rail Routing 



Netlist 


No. Nets 


Breadth First 


Dual Breadth First 






Channel 
Width 


Hop Mismatch 
/Dual-rail 


Channel 
Width 


Hop Mismatch 
/Dual-rail 


barrel 16_wddl 


626 


15 


2.89 


16 





barrel32_wddl 


1482 


20 


2.78 


20 





barrel64_wddl 


3254 


23 


4.41 


22 





mux32_ 1 6bit_wddl 


2964 


12 


0.83 


12 





mux64_ 1 6bit_wddl 


5854 


14 


0.51 


14 





mux8_128bit_wddl 


5932 


11 


2.66 


12 





mux8_64bit_wddl 


2988 


10 


2.25 


10 





xbar_ 1 6x 1 6_wddl 


706 


14 


1.59 


14 






XII. Conclusion 

In this article we discussed the suitabiUty of an 
asynchronous FPGA as a countermeasure to the physical 
cryptanalyses, and as a prototyping device for such 
countermeasures. Intrinsic resistance of asynchronous 
circuits to faults injection, and power constant signalling 
makes them good candidates for such countermeasures. 
Moreover because of their reconfigurable nature, it is 
possible to incorporate dynamic countermeasures along 
with static countermeasures. We believe a practical so- 
lution must use both of them to achieve highest level 
of security. We presented approximate models of power 
consumption and delay on which our countermeasures 
are based, and defined our objectives for static counter- 
measures. 

Keeping the prototyping role in mind, we presented a 
multi-style asynchronous PLB, and proposed a fine grain 
routing architecture, so that any m-out-of-n coding and 
various asynchronous protocols can be mapped onto this 
architecture. As shown in various experiments through- 
out the article the power constant logical level protocols 
can not succeed without balanced interconnects. Layout 
statistics, and experiments on the extracted netlist from 
a prototype FPGA, presents the kind of balancedness 
in dynamic power consumption that can be achieved 
with the subset switchbox and the associated binary 
tree connection box. We also present a new physical 
implementation of the FPGA switch box called Tpair 
switchbox, which provides indiscernability in EM emis- 
sion for the dual-rails routed through it. 

Although the solutions proposed in actual layout as- 
sume bidirectional FPGA interconnect, we show how 
these solutions can be ported to other flavours of in- 
terconnect such as single-driver, both at the switchbox 
and connection box levels. Finally we provide with some 
experimental results on 3 x 3 prototype, although largely 
hindered by a bug in the circuit. However these test 
results will be of use to future designs. We carry out 



a profiling of power consumption for different values of 
hop mismatch and we see a clear dependence. The ex- 
periment on extracted netlist shows the very high speed 
of the asynchronous configuration chain ~1.6 GHz, and 
we also verified the functionality of this in silicon at 
a lower speed, again because of the bug which will be 
corrected in the future designs. 

In this paper we mainly concentrated on layout depen- 
dent (geometric) variations in dual-rails, which is only 
one component of the variations in power consumption 
between two rails. However from experimental results 
we discover significant other components in power con- 
sumption balance. Hence as future research directions, 
we would like to propose nullifying the effect of CMOS 
variation (by taking alternate routes in a random fashion) 
and from other sources of variation from rail to rail. 
We would also like to stress the importance of CAD 
algorithms in physically securing the application, such 
as, automatically routing dual-rails through the FPGA 
in a balanced way. These important issues are the main 
future research challenges. 
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However when the inputs go through a transition, there 
is a temporary short-circuit between the two concerned 
Transmission Gates (TGs). Because of the bidirectional 
nature of TGs, this disturbs the configuration chain itself. 
One LUT memory point is written into another one. 

We did the same simulation with tri-state buffers 
instead of Transmission Gates, in this case the output 
changes according to the inputs and memory points. 
We think that this could be a possible cause of the 
anomaly during the configuration of the FPGA "SAFE". 
In actual silicon, during the configuration, the LUT 
inputs are not forced to '0' or '1'. If the inputs go 
through parasitic transitions during configuration, this 
will introduce invalid state ('!',' 1') in the configuration 
chain leading to random behaviour. 

Further investigations, and testing is going on at 
TIMA, Grenoble, so that the FPGA can be used for 
basic testing, and this simulation results will be taken 
into account during the next tape out. 



APPENDIX 

Due to the problems encountered during the configu- 



ration as explained in subsection |XI-A we investigated 



into the cause of this anomaly. We did the simulation 
of a single 6-input LUT inside the PLB. The imple- 



mentation of this LUT is explained in subsection VI-4 



In figure 16 we provide a more detailed view of the 
LUT configuration chain and switches. In actual silicon 
the memory points are connected to the LUT output 
through Transmission Gates (denoted as pass transistors 



in fig. 16 1. The LUT inputs are decoded in a such a way 
that only one among these parallely connected Trans- 
mission Gates is "ON", depending on the input value, 
and the corresponding memory point should appear at 
the output. 



